Compiling Linguistic Constraints into Finite State Automata
نویسندگان
چکیده
This paper deals with linguistic constraints encoded in the form of (binary) tables, generally called lexicon-grammar tables. We describe a unified method to compile sets of tables of linguistic constraints into Finite State Automata. This method has been practically implemented in the linguistic platform Unitex. 1 Motivation Finite State Models have been intensively used in Natural Language Processing [13]. Nevertheless, because of the complexity of languages, it is often more convenient for linguists to describe linguistic constraints with simpler and more ergonomic representations. For instance, simple regular expressions are sometimes used to express morphological rules [6], inflected forms of dictionaries are preferred to be written in a textual form [3] and syntactic constraints depending on lexicon are represented in the form of binary matrices [4]. Finite State linguistic phenomena are sometimes described with more powerful and more compact formalisms such as (weighted) context-free grammars [10] and recursive transition networks[5]. These representations are then compiled into Finite State Automata or Transducers in order to optimize processing. This paper deals with linguistic constraints encoded in the form of (binary) tables made of rows and columns, generally called lexicon-grammar tables. A row of such table corresponds to the formal description of the lexical and syntactic properties accepted by a lexical item. Each column corresponds to a property. At the intersection of a row and a column, the encoded value indicates whether or not a lexical entry (row) accepts a property (column)1. In this paper, we will describe a unified method to compile sets of tables of linguistic constraints into Finite State Automata. We will also show how it has been practically implemented in the linguistic platform Unitex [11]. 2 State-of-the-Art The first idea of combining binary matrices and automata was pointed out in [7], but the first compilation method has been found in [12] and has been implemented in the linguistic platforms INTEX [14] and Unitex [11]. It was limited 1 Usually, symbol + stands for True and symbol stands for False. O.H. Ibarra and H.-C. Yen (Eds.): CIAA 2006, LNCS 4094, pp. 242–252, 2006. c © Springer-Verlag Berlin Heidelberg 2006 ha l-0 06 37 27 2, v er si on 1 31 O ct 2 01 1 Author manuscript, published in "11th International Conference on Implementation and Application of Automata (CIAA'06), Taipei : Taiwan, Province Of China (2006)" Compiling Linguistic Constraints into Finite State Automata 243 to systems of constraints encoded in one table such as the ones in [4]. It used hand-built parameterized reference automata, representing the sets of the possible syntactic constructions where can enter a fictive lexical entry accepting all properties of the table. Each path is parameterized by one or several parameters that refer to properties that correspond to syntactic constructions (e.g. Prep Det Noun2) or lexical information (e.g. if the constituent Prep accepts the lexical value in). The compilation process consists, for each lexical entry (or raw), in resolving the parameters according to the encoding in the tables. For instance, a false value at a given column indicates that the transitions labeled with the parameter associated with the column, must be removed. A true value indicates that these transitions must be made epsilon-transitions. Then, a specific automaton is constructed for each lexical entry. The automaton representing all described phenomena is simply the union of all constructed automata. It is then optimized by a deterministic minimization operation for text processing efficiency. Several linguistic studies have shown that it is sometimes more convenient to encode constraints of a same linguistic phenomena into systems of multiple tables because some properties can be factorized in different tables to avoid encoding duplication [7,1]. In this case, Roche’s compilation does not work because it does not handle multiple tables. [8] implemented an algorithm compiling systems of multiple tables of specific constraints. These constraints were limited to very local constraints. Tables described the restrictions on the combinations of pairs of lexical elements in sequences where both elements occur consecutively (or sometimes with a grammatical word in between). For instance, for French time expressions, sequence milieu de matin (middle of morning) is forbidden while sequence milieu d’après-midi (middle of afternoon) is accepted. A schemata automaton is used to represent all possible patterns for a type of expressions. This automaton also recognizes bad sequences because it does not take lexical restrictions into account. All forbidden sequences encoded in the tables are put in an automaton that is then applied using the failure algorithm [9] that cuts all forbidden paths in the schemata automaton. [2] proposed an algorithm with no restrictions on the constraints; constraints were represented in relational systems of tables. The algorithm consisted in directly constructing the automaton that recognizes accepted sequences, by using a parameterized reference automaton with parameters resembling Roche’s ones. Nevertheless, the complexity of the construction of the parameterized automaton could grow very fast with the number of tables. For instance, it is not well adapted to Maurel’s time expressions. In this paper, we present a unified algorithm for compiling systems of tables of constraints with no restrictions on the type of constraints. 3 Set of Constraints and Parameterized Automaton This section focuses on the general description of inputs of our algorithm, that are a set of linguistic constraints and a parameterized schemata. They are respectively described in section 3.1 and in section 3.2. 2 Prep Det Noun stands the construction preposition determiner noun ha l-0 06 37 27 2, v er si on 1 31 O ct 2 01 1 244 M. Constant and D. Maurel 3.1 Sets of Linguistic Constraints A syntactic construction is a sequence of syntactic symbols (and sometimes of lexical symbols): for instance, the syntactic construction N0 V N1 is composed of a noun phrase (N0) followed by a verb (V) and then another noun phrase (N1). Each syntactic symbol have a set of possible lexical realizations, e.g. V could be eat or walk. Though, syntactic constructions have lexical restrictions; their acceptability can depend on the lexical realizations of a syntactic element. For example, the transitive verb eat can enter the constructions N0 V N1, while the intransitive verb walk cannot3: John is eating an apple. *John is walking an apple. Such constraint is called a one-dimensional constraint because it depends on only one element (the verb). There can also exist lexical restrictions on the combination of two syntactic elements in the context of a construction. For instance, in the construction N0 V N1 Prep N24, there exist lexical constraints on the pair (V,Prep): pairs (receive,from) and (give,to) are acceptable, while (receive,to) and (give,from) are forbidden as it is shown in the sentences below. John received a present (*to+from) Mary. John gave a present (to+*from) Mary. Such constraint is called a two-dimensional constraint because it depends on the combination of two elements (the pair verb-preposition). Practically, a given constraint is not only limited to a single construction, but also a set of equivalent constructions. For instance, the constraint on the pair (V,Prep) in the example above is available as well for the equivalent interrogative construction : Who received a present (*to+from) Mary ? Moreover, linguistic constraints can also restrict the combination of more syntactic elements cooccurring in a same construction. Theoretically, such constraints can be decomposed into elementary constraints that are one-dimensional and two-dimensional ones, all related with logical AND operators. For example, the acceptability of frozen constructions of the form N0 be Prep N Prep1 N1, can depend on the lexical combination of Prep, N and Prep1 such as in: The text is (in+*on) contradiction (with+*to) the law. Verifying if this constraint is valid is equivalent to checking if elements in and contradiction can cooccur in this context and if contradiction and with 3 In linguistic examples, the symbol * is the forbidden symbol and symbol + is the disjonction symbol. 4 N0, N1 and N2 are noun phrases, V is a verb and Prep a preposition. ha l-0 06 37 27 2, v er si on 1 31 O ct 2 01 1 Compiling Linguistic Constraints into Finite State Automata 245 cooccur. Thus, in the next sections, we will consider that there exist only onedimensional and two-dimensional constraints. One-dimensional ones are encoded in the form of binary vectors, each element corresponding to a lexical value; two-dimensional lexical constraints are encoded in the form of binary matrices, encoding the restrictions on the combination of pairs of lexical values. Examples of such representations are given in figure 1. The binary representations describe lexical constraints on geographical names. Such names can enter two constructions Detc Npr Nc (labeled NN) and Detc Nc of Npr (labeled NPN), where Detc is a definite determiner (e.g. the), Npr is a proper name such as Adriatic, Marmara, Paris... and Nc is a location noun classifier like city, sea... Figure 1(a) presents two-dimensional constraints between lexical realizations of Nc and Npr (sea and Adriatic); figure 1(b) and figure 1(c) present one-dimensional constraints depending on Npr, indicating whether or not it can enter constructions NPN (city of Paris) or NN (Adriatic sea). (a) Names-Classifiers (b) NPN constraint (c) NN constraint Fig. 1. Oneand two-dimensional constraints 3.2 Parameterized Schemata Automaton A parameterized schemata automaton is a hand-built acyclic automaton that explicitely represents all possible syntactic realizations that the studied linguistic phenomenon can have. It is used as a basis to build an automaton representing all accepted constructions of this phenomenon, taking encoded lexical restrictions into account. Each path represents a possible construction. Labels of this automaton are either lexical or syntactic elements, or parameters. Syntactic elements that may cause lexical constraints in the construction are marked as parameters. They are called syntactic parameters. Such parameters are denoted with the name of the syntactic element preceded by symbol @: for instance, @X is the parameter associated with the syntactic symbol X. Sets of constructions (i.e. sets of paths) can also be parameterized because their acceptability may depend on the lexical realizations of some syntactic ”parameterized” elements. We call them construction parameters. They are denoted with the label assigned to the set of constructions, preceded and followed by symbol @: for example, @P@ is the parameter associated with the constructions labeled P. An example of such automaton is given in figure 2: it consists of the parameterized schemata automaton used for geographical names. @Nc@ and @Npr@ are syntactic parameters; @NN@ and @NPN@ are construction parameters. ha l-0 06 37 27 2, v er si on 1 31 O ct 2 01 1 246 M. Constant and D. Maurel
منابع مشابه
Compiling Simple Context Restrictions with Nondeterministic Automata
This paper describes a non-conventional method for compiling (phonological or morpho-syntactic) context restriction (CR) constraints into non-deterministic automata in finite-state tools and surface parsing systems. The method reduces any CR into a simple one that constraints the occurrences of the empty string and represents right contexts with co-determististic states. In cases where a fully ...
متن کاملCompiling and Using Finite-State Syntactic Rules
A language-independent framework for syntactic finlte-state parsing is discussed. The article presents a framework, a formalism, a compiler and a parser for g rammars written in this forrealism. As a substantial example, fragments from a nontrivial finite-state grammar of English are discussed. The linguistic framework of the present approach is based on a surface syntactic tagging scheme by F....
متن کاملPolynomial-Time Reformulations of LTL Temporally Extended Goals into Final-State Goals
Linear temporal logic (LTL) is an expressive language that allows specifying temporally extended goals and preferences. A general approach to dealing with general LTL properties in planning is by “compiling them away”; i.e., in a pre-processing phase, all LTL formulas are converted into simple, non-temporal formulas that can be evaluated in a planning state. This is accomplished by first genera...
متن کاملReduction of Computational Complexity in Finite State Automata Explosion of Networked System Diagnosis (RESEARCH NOTE)
This research puts forward rough finite state automata which have been represented by two variants of BDD called ROBDD and ZBDD. The proposed structures have been used in networked system diagnosis and can overcome cominatorial explosion. In implementation the CUDD - Colorado University Decision Diagrams package is used. A mathematical proof for claimed complexity are provided which shows ZBDD ...
متن کاملCompiling Apertium morphological dictionaries with HFST and using them in HFST applications
In this paper we aim to improve interoperability and re-usability of the morphological dictionaries of Apertium machine translation system by formulating a generic finite-state compilation formula that is implemented in HFST finite-state system to compile Apertium dictionaries into general purpose finite-state automata. We demonstrate the use of the resulting automaton in FST-based spell-checki...
متن کامل